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A diagnostic test in English as a second language 
should be a series of miniature tests on specific problems. Subscores 
in each area should be considered rather than a total score. The 
results should be used to probe mastery in an area rather than 
provide the means for comparing one student against another. The 
statistical reliability of the results does not necessarily depend on 
test length. The teacher should look at each item for each student 
rather than the score and should spend more time studying the 
analysis cf each student's test. The criterion of the percent of 
correct decisions may be a more meaningful measure than ascertaining 
the traditional coefficients of reliability. Tables provide the 
statistical data under consideration, (Va) 
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Theoretlc-.ii Contribution to ESL Diagnostic Test Construction 



Charles H. Blatchford 
University of Hav/aii 



This paper considers the results of an experimental 40-item diagnostic 
test dealing with 10 grammatical mistakes typically made by Chinese students; 
the analysis focuses on the scores of these 10 mini-tests. The purpose of 
the experiment was to calculate the reliability of the mini-tests and then 
to determine hcv; many items are needed to establish "good” reliability. 

Two forms (A £ B) were administered a week apart to 298 ESL students.. 
Validity of the mini-tests was checked by constructing a composition with 
the same grammatical mistakes and asking the students to identify them. 

Reliability coefficients (K-R if2G) ranged from .67 to .91. The data 
were then analyzed as if each mini-test in Form A had only 3 items, and then 
only 2 items; r ranged from .61 to .87, and from .28 to .82 respectively. 

From a different point of view, the optimum number of items may be 
suggested by asking how much useful information is lost if a decision is 
made on the basis of 2 items rather than 4. If the criterion Is the student* 
consistently good, or poor, performance from A to B, the degree of such 
consistent performance is very stable whether based on 3, or 2 items 
per subtlest . 
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"A Theoretical Contribution to ESL Diagnostic Test Construction"^ 



^Much of the content of this paper, which was presented at the TESOL 

Convention, New Orleans, March 7, 1971, is derived from my Columbia University 

dissertation, "Experimental Steps to Ascertain Reliability of Diagnostic 

Tests j.p English as a Second Language" (Ann Arbor: University Microfilms, 

1970, Order i^'70-18 ,785) . 

Charles H. Blatchford 
University of Hawaii 

This paper is addressed to some problems in diagnostic testing, and I 
should probably start out by defining just what a diagnostic test is. In 
TESL we usually think of A.L. Davis' "Diagnostic Test for Students of 
English as a Second Language"^ as a prime example of a test in tl i s category. 



L. Davis , "Diagnostic Test for Students of English as a Second 
Language" (V/ashington : Educational Services, 1953, and now distributed by 
McGraw-Hill). I 



The difficulty is that when the test is given, it most likely loses its 
diagnostic character, because its score is reported as a single number. 

First, then, my definition of a diagnostic test is functional, and 
depends on the way scores are reported: whenever several part scores are 

reported for a test, something more than that global concept of "English” 
is being tested, and certain aspects are therefore diagnosed, no matter 
whether the test is billed as an achievement test, a proficiency test, or 
Y^hatever . In other words, the degree to which a test is diagnostic depends 
not so much on the purpose of the test, but on the way in which scores are 
analyzed. Let us consider TOEFL for a moment: TOEFL is usually considered 

to be a proficiency test, and v/hen its total score is considered by art 
admissions officer, it can quite riglitly bc! so clcissified. Hov.'cver, if one 
looks at the five part-scores for reading comprehension, vocabulary, and so 
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on. tho test is serving a diagnostic purpose, in that information about an 
individual's particular strengths and/or weaknesses is obtained. That is, 
we have specific information not on "English," but on certain abilities or 

skills. 

t 

Second, my definition of the ideal diagnostic test is that it be 
criterion-referenced, not norm-referenced. That is to say, one should look 
at whether mastery of the content has taken place-comparison with a 
criterion- -rather than at how a student fares in relation to others— 
comparison with a norm. Although I just cired TOEFL as one example of gross 
■ diagnosis, it is a norm-referenced test, and the scores will not help 
inform the classroom teacher about specific weaknesses. The Davis test, on 
the other hand, is a criterion-referenced test. But unless the answer sheet 
is very carefully studied, the test with its one score will not give the 
■ teacher much information on strength or weakness. Usually, it is used as a 
placement test since its score is translated into specifications of how 
much more English a student should study. To summarize, first, a diagnostic 
test should have subscores i and second, it should not even have a total 
score, so that the temptation to make norms will be avoided. 

In essence., a diagnostic test should be cons.idered as a series of 
miniature tests on specific problems. But as soon as one considers short 
tests, there is the difficulty of statistical reliability-that index of 
how stable an individual's performance is from one form of a test to 
another. Reliability is felt to be dependent on test length: the longer 

the test, the more reliable. But, with many tests, we cannot afford great 
length. As Thornd.ike and Hagen put it, "Diagnostic testing faces a vet-y 
troublesome dilemma. How is the test to provide suffic.ient diagnostic detail. 
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and yet appraise each separate ability with sufficient reliability? 

^Rober't L. Thorndike and Elizabeth Hagen, Measurement and Evaluat-Loyi 

in Ps nohotogy and Education (New York: Wylie, 1961), p. 297, 

To attack this problem of the reliability of miniature tests, an 
0 xpGi?iTn 0 n‘ta,l 5 un'ti.mGd.j 40— itGin instmiriGn't was cons'tir'U.C'tGd. "to tiGSt "tGH 
grammatical problems, not general abilities. Examples of such problems 
are the use of wish and the patterns its use requires; ^ and "contrary- 
to fact" conditions; the use of b ecause and therefore as connectives; the 
use of since, for, and ago ; and so on. Each of these ten grammatical 
problems was test ;d by four multiple-choice items and- the options were 
based upon Chinese students' mistakes. For example, two of the four items 
tGsting v/j.s h WGr 0 as follows: 

I can n 0 V 0 r finish my work. I wish I (1) havo moro time. 

( 2 ) to have moro timo, 

(3) could hav 0 more time, 

(4) have had more time 

(9) I don’t know t^iG answer. 

It takes an houx'" to get to school. 

I wish I (1) could live nearer. 

(2) have lived nearer. 

(3) to live nearer. 

(4) live nearer. 

(9) I don’t know the ansv/er. 

Two of those testing for > since > and ago were . as follows: 

I have been watching TV (1) fox' an hour. 

(2) since an hour. 

(3) an hour ago. 

(4) fron. an hour. 

(9) I don’t know the answer. 

I have been liv.ing at 350 Mam Street (l) two years ago. 

(2) from two years, 

(3) for two ydars, 

(4) since two years. 

(9) I don’t know the answer. 
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It can be seen that the items are structurally similar, although the options 

are given in different (randomized) order. 

To 298 secondary and college foreign students , two forms of the test 
were administered a week apart, so that a Pearson product -moment reliability 
measure could be made. For each of the ten grammatical problems, there was 
then a reliability coefficient. Such product -moment reliability ranged 
from .37 (#2) to .79 (//6) as seen in Table 1. 

Table 1 about here 



By Kuder-Richardson Formula 20 for internal consistei. cy , the ten coefficients 
ranged from .67 (//9) to .91 (#6). "Good" reliability is considered in the 

.90's or high .80's. 

'■•^vld'P. Harris, Testing as a SeaondTanguage (New York: 

McGraw-Fiill > 1969), pp. 16-1 7. 



Table 2 about here 



The. reliability figures were then recalculated on the miniature tests by 
dropping one of the four items and thus considering each mini-test as having 
only three items. Each reliability figure drops. Similarly, when each 
mini-test was considered to have only two items, the coefficients dropped yet 
again. The range of these coefficients was from .28 (//9) to .82 (.HS) ■ 

Still, in many of these mini-tests, there is good internal consistency 
reliability, or at least it can be considejr-ed to be good, when there are, 
after all, on.ly two itcm.s making up each tc;st! 
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It may now be asked what these data say regarding the optimal number of 
items per mini-test. It seems that for most purposes, V7hcre one is 
interested in descriptions of, rather than decisions about, individuals , a 
test of tv7o items per problem tested may be sufficient. 

From another point of view, the question of reliability can be 
considered not in terms of either internal consistency or product-moment 
coefficients. The question of how long the test should be may be rephrased 
to ask how much useful information is lost if a diagnosis of a student’s 
English is based on a mini-test of two items rather than four. To attack 
this problem, let’s look at a hypothetical situation. Four correct 
responses out of four will be Classified as [+] and 3, 2, 1, or 0 right 
as [~D. For example, if on Form A a student gets two items out of four 
right, the student will be classified as [-*3 by this criterion. Should the 
teacher decide to teach him another lesson on the given problem? Let's 
say a decision to teach is made. If on Form B (given a week later but with 
no intervening instruction) the student scores two out of four again 
(classified as [-3), the correct decision was made. His performance was 
consistent in a negative way T-,-3. Conversely, if a student got a score 
of four on Form A (classified as [+3), and a four on Form B [+3, and if 
the decision not to teach more had been made., the consistency of his per- 
formance C +,+3 also corroborates the decision as being rig)it, this time 
in a positive way. Thus, similarity of performance,; [+,+3 or C-,-3 is the 
basis for determining whether the correct decision has been made. 

Let us look at some of the data in this light. The first line in 
Table 3 can be read as follows: 66 students who got four right on Form A 

got four right on Form B; 127 v/ho got less than four right on Form A got 
less than four right on Form B. The students classlPiod in these tv7o cells. 
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r.H -5 + n cincl porf or ifK:d cons list ent ly from ono tcjrjt iiv/; to llio next , and 

for them a correcL clr^^iiF^ion wac made- > that is, the C+jt*) cell incirnlioro 
needed no further instruction, and the C-,-] cell members did. Correct 

t 

decisions v/ere made for 193 cases, v/hich are .647 of the total of 298. 



Table 3 about here 



Thus, if one had based his decisions just on Form A performance, his 
decision v;ould have been corroborated in 66% of the cases. Or , put another 
way-, assessments of a student’s knov/ledge based on Form A perl o.i?ina nee seem 
to be borne out against the criterion of Form B perforrncince in 65 out of 
100 cases. The numbers in the other two cells indicate erroneous assess- 
ment. Thirteen students \-/ho got less than four right [-] on Form A 
performed perfectly on f'orm B, and 92 v/ho performed perfectly [+j on Form A 
got less than four right [-] or Form B. Their inconsistent performance 
would have led to mistaken assessment and placement. In mini-tests one 
through ten, the percentages of correct assessment range from 62% (//2) to 
79% (^6). If one decided from chance alone, or if one had no prior 
knowledge of the examinees, one would expect to be righ"^; 50% of the time. 

The percentages just g;ivc;n thus improve decision making. If one dccldod 
only on the basis of Form A, 53% (158 out of 298); if on the basis of Form B 
only, 27% (79 out of 298). 

The figures and percentages just discussed are those for Form A when 
four items constitute each mini-test. When the number on Form A is reduced 
from four to three (as shovm in the next column of Table 3), the percentage 
of exe.mincGS porforming consistently declines, but only very slightly. When 
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the number of items is iurther reduced to , the percentage decreases a 
maximum of five percentage points fi'om what it was v?hen the mini-test 
comprised four items. And in set six, v;hich generally appears to have the 
best Kuder-Richardson Formula 20 reliability, there is even a tiny gain! 

To summarize, when it comes to the percent of correct decisions, the shorter 
jjjini-tests seem to give as much iiiformation as the full four items. The 
median percent of correct decisions v;hen the test is four items long is 
.69, and when it is two items long, is also .69. It appears that the 
additional two items do not provide much, if any, more information. 

So much for the theoretical side. What about the practical? I assume 
that since there are not many diagnostic tests, most are made by the 
teacher. What does the information above mean for the teacher when he is 
constructing a test? 

1. I believe it means that v;ith confidence he can use only two items per 
problem and be fairly sure of his diagnosis. 

2. I believe it means that he should look at each item for each student — 
not using total scores. This procedure will obviously require much more 
time, but unless it is followed, the time spent in testing is not reailly 
worthwhile. 

3. I believe it means that he can individualize instruction to a greater 
extent if he is willing to spend more'time in studying the analysis of each 
student’s test. Such individualization will require the abandonment of set 
ways. It will mean that he not give his pat diagnostic test at th.i beginning 
of the term, generalize about total scores, and then proceed blithely with 
the set syllabus . If that procedure is follov7ed , both criteria for a 
diagnostic test v;ith which this papei' was introduced arc being discarded. 

In conclusion, provided that tc.-;t -makers follow the usual canons of 
carefully constructi.ng and pre-testing items, I believe the teachej:' can 
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,.ust the dieanostie netu.e of his results eVen if the ,tini-tests on each 
,ra»™atioal pnohle. contain only tv,o ite»s-or even only one, and if ^ 
sufficient time is spent looking at the test papers, not the score, 
the criterion of the percent of correct decisions »ade is perhaps a ™ore ^ 
.eaninlful measure than ascertaining traditional coefficients of reltahrlrty 



o 

ERIC 



•3 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 



Table 1 



Product -MoTTient Reliability Coefficients When 
Forms A and B have ^ Items in Each Miniature Test 

( N = 298 ) 



.437 

.374 

.445 

.601 

.620 

.785 

.462 

.616 

.671 



A^B"* 

.420 

.369 

.435 

.576 

.595 

.759 

.470 

.586 

.602 

.582 



.411 

.363 

.406 

.512 

.586 

.761 

.458 

.548 

-.531 

.596 



A^B^ 

.361 

.264 

.329 

.358 

.503 

.666 

.373 

.525 

.587 

.466 



.418 

.383 

. 4^3 

.595 

.627 

.764 

.‘455 

.556 

.660 

.601 



.401 

.371 

.381 

.581 

.627 

.744 

.323 

.635 

.613 

.572 



.398 

.292 

.315 

.438 

.579 

.680 

.173 

.642 

.408 



618 



523 



Table 2 



Kuder--Richardi;oTi Formula //20 Interna]. Consistency Reliability 
V/hen Forms A and B Have ^ Items in Each Miniature Test 

( N = 298 ) 



Mini- 

test 


a'* 




a2 




B^ 


b2 


1 


.873 


.835 


.780 


.875 


.832 


.776 


2 


.854 


.798 


.64 2 


,726 


,720 


.628 


3 


.786 


.769 


.654 


. .778 


.732 


.662 


4 


.829 


.750 


• .620 


.797 . 


.723 


.574 


5 


.862 


.802 


.754 


.689 


.696 


.740 


6 


.906 


.870 


.818 


.909 


,876 


,774 


7 


.794 


.721 


.590 


.615 


.534 


.290 


8 


.840 


.777 


.686 


.685 


.580 


.680 


9 


.670 


.609 


,276 


.704 


.583 


.222 


10 


.781 


.705 


.744 


.848 


.841 


.774 
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Tabic 3 



Consistency of Performance from Form A to Form B as Measured by 
Numbers and Percents of Examinees Getting Specified Scores 

( N = 298 ) 



Number of Items in Form A Sets 



Mini- 

test 


Form 

B 

Score 




4 






3 






2 




1 




// Right 
0-3 4 

+ 




# Right 
0-2 3 

+ 


’ 0 ^ 
*0 


# Right 
0-1 2 
+ 


% 


# Right 
0 1 
+ 


% 


1 . 


4 + 


13 1 


66 


.65 


12 j 


67 


.64 


111 


68 


.63 


7 j 72 


,58 




0-3 - 


127 1 


92 




123 1 


96 




119 •; 


100 




101 lll8 




2 


^ + 


821 


76 


.62 


79 1 


79 


.63 


77 j 


81 


.62 


63 1 95 


,60 




0-3 - 


110 1 


30 




110 1 


30 




104 1 


36 




83 1 57 




3 


4 + 


43| 


127 


.69 


33 1 


137 


.69 


29 i 


141 


.68 


21 |149 


• 65 




0-3 - 


77l 


51 




67 1 


61 




61 1 


67 




46 1 82 




4 


4 + 


261 


84 


.74 


24 [ 


86 


.74 


20 1 


90 


.71 


7 1103 


.56 




0-3 - 


137 1 


51 




134 1 


54 




122 1 


66 




64 Il24 




5 


4 + 


191 


57 


.78 


18 1 


58 


,77 


15 1 


61 


.75 


13 1 63 


• 71 




0-3 - 


1761 


46 




171 1 


51 




163 1 


59 




147 1 75 




6 


4 + 


. 321 


109 


.79 


311 


110 


.78 


19 1 


122 


.80 


14 1127 


• 77 




0-3 - 


127 1 


30 




122 1 


1 35 




115 1 


42 




101 1 56 




7 


4 + 


241 


72 


.69 




121 


.69 


16 1 80 


.64 


4 j 92 


.51 




0-3 - 


1341 


^ 68 




129 1 


[ 73 




1121 


GO 




61 ll41 




8 


4 + 


18 1 


153 


.64 


14 1 


[157 


.64 


111 


160 


.64 


6 1165 


.63 




0-3 - ' 


41 1 


1 86 




35 1 


I 92 




30 1 


97 




23 1104 




9 


4 + 




1172 


.71 


72 1 


1175 


.73 


66 1 


.181 


.73 


5 1242 


,86 




0-3 - 


41 1 


1 10 




41 i 


1 10 




37 1 


1 14 




15 1 36 




10 


4 + 


63 


1 


.70 


■ 61 


1 54 


,69 


29 1 


1 86 


.71 


24 1 91 


.64 




0-3 - 


157 


1 26 




151 


1 32 




126 1 


1 57 




100 1 83 






is the 


sum of 


the 




and [+ 5 +] cells divided by N. 
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